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Abstract. This paper discusses two new procedures for extracting verb valences from raw texts, 
with an application to the Polish language. The first novel technique, the EM selection algorithm, 
performs unsupervised disambiguation of valence frame forests, obtained by applying a non-probabilistic 
deep grammar parser and some post-processing to the text. The second new idea concerns filtering of 
incorrect frames detected in the parsed text and is motivated by an observation that verbs which 
take similar arguments tend to have similar frames. This phenomenon is described in terms of newly 
introduced co-occurrence matrices. Using co-occurrence matrices, we split filtering into two steps. The 
list of valid arguments is first determined for each verb, whereas the pattern according to which the 
arguments are combined into frames is computed in the following stage. Our best extracted dictionary 
reaches an F-score of 45%, compared to an f -score of 39% for the standard frame-based BHT filtering. 
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1. Introduction 



The aim of this paper is to explore two new techniques for verb valence extraction 
from raw texts, as applied to the Polish language. The methods are novel compared 



to the standard framework (Brent, 1993 Manning, 1993 Ersan and Charniak, 1995 



Briscoe and Carroll, 1997] ) and motivated in part by resources available for this language 
and in part by certain linguistic observations. 

The task of valence extraction for Polish invites novel approaches indeed. Although 
there is no treebank for this language on which a probabilistic parser can be trained, 
a few interesting resources are available. Firstly, the non-probabilistic parser Swigra 



dWolihski, 2004t[Wolihski, 2005D p rovides an efficient implementation of the large formal 
grammar of Polish by ISwidzihskil (119921) . Secondly, th ree detailed valen ce dictionaries 



have been compiled by formal linguists ( [Polahski, 1992t|Swidzihski, 1994[[Bahko, 2000[ ). 
Those dictionaries are potentially useful as a gold standard in automatic valence ex- 
traction but two of t hem, iPolahskil and IBahkol are printed on paper in several volumes, 
whereas ISwidzihs"ki]s d ictionary, though rather small, is available electronically. The 
text file by ISwidzihskil lists about 1000 verbal entries whereas 6000 entries can be found 



in COMLEX, a detailed syntactic dictionary of English ( |Macleod et al., 1994D . 

The information provided by Polish valence dictionaries is of comparable complexity 
to information available in COMLEX. Verbs in the dictionary entries select for nominal 
(NP) and prepositional (PP) phrases in specific morphological cases (7 distinct cases 
and many more prepositions). Valence frames may contain the reflexive marker si^ and 
certain adjuncts (e.g., adverbs) but not ne cessarily a s ubject, which also contributes to 
the combinatorial explosion. For instance, Swidzihskil (I1994p provides 329 frame types 
for the 201 test verbs described later in Section [H The most frequent frame among 
them, {np(nom), np(acc)}, is valid for 124 test verbs and there are 183 hapax frames. 



t The author is presently on leave for Centrum Wiskunde & Informatica, Science Park 123, NL-1098 
XG Amsterdam, the Netherlands. E: debowskiQcwi .nl T: +31 20 592 4193, F: +31 20 592 4312. 
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Such lack of computational data is a strong incentive to develop automatic valence 
extraction as efficiently as possible. Thus we have devised two procedures. The first one, 
called the EM selection algorithm, performs unsupervised selection of alternative valence 
frames. These frames were obtained for sentences in a corpus by applying the parser 
Swigra and some post-processing. In this way, we cope with the lack of a probabilistic 
parser and of a treebank. 

The EM selection procedure, to our knowledge described here for the first time, 
assumes that the disambiguated alternatives are highly repeatable atomic entities. The 
procedure does not rely on what formal objects the alternatives are but it only takes 
their frequencies into account. Thus, the EM selection looks like an interesting baseline 
algorithm for many unsupervised disambiguation problems, e.g. part-of-speech tagging 
( Kupiec, 1992 Merialdo, 1994 ). Computationally, the algorithm is far simpler than the 
inside-outside algorithm for probabilistic grammars ( Chi and Geman, 1998 ), which also 
instantiates the expectation-maximization scheme and is used for treebank and valence 
acquisition ( Briscoe and Carroll, 1997 Carroll and Rooth, 1998 ). 

The second novel technique concerns filtering of incorrect valence frames detected 
in the parsed text. Despite a large number of distinct frames occurring in the available 
Polish valence dictionaries, verbs which take similar arguments tend to have similar 



frames. This phenomenon was surveyed in particular by D^bowski and Wolinski (|2007l 



and their observations are reported here in more detail in Section [2l The cited authors 
proposed that sets of verbal frames be described in terms of argument lists, which 
strongly depend on a verb, and pairwise combination rules for arguments, called co- 
occurrence matrices, which are largely independent of a verb. 

In this article, we recall this formalism and propose an analogous two-stage approach 
to filtering incorrect frames. The list of arguments is filtered for each verb initially and 
then the co-occurrence matrices are processed. In both steps we use filtering methods 
that resemble those used so far for whole frames. We will show that verbal frames are 
easier to extract when decomposed into simpler entities than when treated as atomic 
objects. The qualitative analysis of errors is also easier to perform. 

Verb valence frames have been learned as atomic entities in all previous valence 
see also: ISarkar and Zema"iil 120001 [Przepiorkowski and Fast 



extraction experiments (see also: ISarkar and Zemanl 12000 
120051 Fast and Przepiorkowski 120051 [Chesley and Salmon- Alt I2006|] although recent 
research exploits certain correlations among the verb meanings, diathesis, and sub- 



categorization dMcCarthyj [SnOll IKorhonenl [20021 Chapter 4; [Lapata and Brewj [20041 
ISchulte im Waldel 120061) . This line of computational experiment is more and more in- 



spired by formal research in semantic classes of arguments, verbs, and frame alternations, 
cf. ILevinI ([TMSll and [Baker and Ruppenhofef] ([2?l02]) . 

Our unorthodox less resource- and theory-intensive approach to decomposing valence 
frames stems from an independent insight into their distribution and structure, built on 



the preliminary valence extraction experiment for Polish by Przepiorkowski and Fast 
(|2005|) . In that experiment, the F-score of the automatically extracted dictionary 
reached about 40%, whereas the F-score of two gold-standard dictionaries by [Polanskil 
(|1992l) and [Bankol (I2000p compared with each other equalled 65%. This apparently low 
agreement between manually compiled dictionaries and the lack of explicit information 
about semantic classes inspired us to seek other patterns in valence frames and to 
develop an alternative extraction scheme. 

The experiment described in this paper differs from both of the works by 
Przepiorkowski and Fast in several aspects. Firstly, we explore whether it is better 
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to filter frames in two steps or in one step as done previously. Secondly, we extract 
all kinds of arguments occurring in the gold-standard dictionaries, whereas only non- 
subject NPs and PPs were considered in the two previous works. Thirdly, we compare 
our extracted dictionaries with three gold-standard dictionari es simultane ously and 
investigate types of errors. Fourthly, we use the Swigra parser of ISwidzihskit s grammar 
and the EM algorithm to parse raw texts, whereas [Przepiorkowski and Fast applied 
a very simple regular grammar of 18 rules. We analyze fewer texts but we analyze 
them more thoroughly, which means higher precision but not necessarily lower recall. 
The final difference is that our test set covers twice as many verbs (201 lemmas) as 



considered by Przepiorkowski and Fast 



The frame-based binomial hypothesis test (BHT, IBrentl I1993P is assumed in this 
work as a baseline against which our new ideas of filtering are compared, since it gave 
the best results according to Fast and Przepiorkowski (I2005p . The authors reapplied 
several known frame filtering methods: the BHT, the log-likelihood ratio test (LLR) 



Gorrell, 1999t Sarkar and Zeman, 2000), and the maximum likelihood threshold (MLE) 



Korhonen, 2002D. Applying the one-stage BHT to our data, we obtain 26% recall 



and 75% precision {F = 39%). To compare, the dictionary obtained by applying the 
novel two-stage filtering of frames to the same counts of parses exhibits 32% recall 
and 60% precision {F = 42%). The set-theoretic union of both dictionaries combines 
their strengths and features F = 45%. These statistics relate to extracting whole 
frames, whereas Przepiorkowski and Fast obtained similar values for the simpler task 
of extracting only NPs and PPs. We find our results to be an encouraging signal that 
similarities of frame valence frame sets should be exploited across different verbs as 
much as possible, and also in an algorithmic way. The method introduced here allows 
various extensions and modifications. 

The rest of this article describes our experiment in more detail. In Section O a brief 
introduction to co-occurrence matrices is provided; Section [3] presents the verb valence 
extraction procedure; the obtained dictionary is analyzed in SectionlH Section[5]contains 
the conclusion. Three appendices follow the article. Appendix [A] gives additional details 
for the co-occurrence matrix formalism; Appendix[B]describes the initial corpus parsing; 
Appendix ICl introduces the EM selection algorithm. 



2. The formalism of co-occurrence matrices 

Let us introduce the new description of valence frames which is applied to valence 
extraction in this paper. To begin with a more usual formal concept, consider a proto- 
typical entry from our gold-standard valence dictionary. It consists of the set of valence 
frames 

r {np(nom),np(acc)}, ^ 
F{przytapac) = < {np(nom), np(acc), na-l-np(loc)} , > (1) 
[ {np(nom), sie, na-l-np(loc)} J 

for the verb przyiapac (= to catch somebody red-handed). The symbol sie denotes the 
reflexive marker si^ and na-l-np(loc) is a prepositional phrase with preposition na 
(= on), which requires a nominal phrase in the locative case. The notations for cases 
are as in the IPI PAN Corpus tagset: nom(inative), gen(itive), dat(ive), acc(usative), 
inst(rumental), loc(ative), and voc(ative), cf. [Przepiorkowski and Wolihski| (|2003|) or 
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http : //korpus . pl/l For simplicity, it is assumed that no argument type can be 



repeated in a single valence frame. This restriction can be overcome by assigning 
unique identifiers to repetitions. 

There are two subtleties which concern our implementation of notation ^ and are 
worth exposing to avoid possible confusion later: 

(i) We treat the reflexive marker si^ as an ordinary verb argument rather than as a part 
of the verb lemma. The frames for a verb without si^ are merged with the frames 
of its possible counterpart with si^ into one entry, unlike the traditional linguistic 
analysis applied in Polish valence dictionaries. This affects all our counts of verb 
entries in the following work. However, we do not combine entries for corresponding 
perfective and imperfective verbs, which often take the same frames and occur in 
almost complementary pairs, cf. |Mlynarczyk| (I2004p . 

(ii) A valence frame may lack the subject np(nom). According to the analysis applied in 
Polish dictionaries, this lack is a counterpart of the English expletive it and it differs 
syntactically to the dropped subject (denoted always as np(nom) in the valence frame 
for a sentence). If a sentence lacks an overt subject, such a subject can or cannot 
be inserted depending on the verb. Certain verbs do not subcategorize for subject 
at all, e.g. trzeba (= should) or brakowac (= lack). Several other verbs often occur 
without the subject but allow it in certain uses, such as padac (= fall/rain). The 
valences of the second class of verbs are particularly hard to extract automatically 
since Polish is a pro-drop language. 

Summarising our remarks, there are many specific verbs such that sie or np(nom) (a) 
must appear in all their frames, (b) cannot appear in any frames, or (c) may be present 
or omitted, affecting the occurrence of other arguments. Similar interactions involving 
the reflexive marker and the subject have been studied in valence acquisition for other 



languages (Mayol et al., 2005 Surdeanu et al., 2008 



D^bowski and Wolihski (|2007|) proposed an approximate description of complex in- 



teractions within the frame set F{v) in terms of three simpler objects: the set of possible 
arguments Ij{v), the set of required arguments E(f) C L(t;), and the argument co- 
occurrence matrix M(f) : Ij{v) x Ij{v) — > {<— x,_L}. The definitions of the first 
two objects correspond to the following naming convention. An argument is possible 
for V if it appears in at least one frame and it is called required for v if it occurs in all 
frames. Thus we have 

Uv):= U /, E{v):= f] f. (2) 

feF{v) fGFiv) 

For instance, 

Jj{przyiapac) = {np(nom), np(acc), sie, na+np(loc)} , 
Fi{przyiapac) = {np(nom)} . 

To define the co-occurrence matrix, let us denote the set of verb frames which contain 
an argument type a as (a) := {/ G F(f) j a G /}. Next, we will introduce five implicitly 
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verb- dependent relations: 



axb (a) n (6) = 

a ^ b -4=^ (a) = (6) 

a^b ^ (a) = (a) n (6) ^ {b) 

a ^ 6 ^ (a) / (a) n {b) = (b) 

a±b ^ {a)n{b) ^{(a),(6),0} 



(a excludes b), 
(a and 6 co-occur), 
(a implies 6), 
(6 implies o), 

(o and b are independent). 



Then the cells of matrix M(u) are defined via the equivalence 

M{v)ab ■■= R aRb 



(3) 



for the verb arguments a,b G L(f). The symbol -L that denotes "formal" independence 
was chosen intentionally to resemble the symbol _IL, which is usually applied to denote 
probabilistic independence. 

For the discussed example we obtain: 



'M.{przylapac) 


np(nom) np(acc) sie 


na+np(loc) 


np(nom) 






np(acc) 


^ X 


_L 


sie 


X ^ 




na+np(loc) 




<— > 



This unconventional approach to describing verb valences appears quite robust. For 
example, consider an observed agreement score (cf. lArtstein and Poesiol I2008h of the 
co-occurrence matrix cells taken for the triples (a, b, v) appearing simultaneously in two 
compared dictionaries. Formally this agreement score equals 



A,. 



\{ia,b,v) eT\Mi{v)ab = M2{v)ab}\ 
\T\ 



(4) 



where {Mj(w) \ v £Vi}, i = 1,2, are the two compared collections of co-occurrence 
matrices and 

T = {{a,b,v) \ v £VinV2, a,b £ Li(^;) n L2(w)} 

is the appro priate subse t of triples (a, b, v). The agreement scores ^ for the dictionaries 
of IPolahskil ISwidzihskil and IBahkol range from 86% to 89%, cf. [D^bowski and WolihskT 

mm. 

[D^bowski and Wolihski noticed also that the values of the matrix cells M.{v)ab for 
fixed arguments a and b tend not to depend on the verb v. The latter fact appears 
favourable for automatic valence extraction. We may learn objects L(f), E(t;), and 
M.{v) separately with much higher accuracy and restore the set of frames F{v) from 
these by approximation. For example, consider the maximal set F{v) C 2^^'') of frames 
that contain all required arguments in E(u) and induce the co-occurrence matrix M(f). 
Precisely, 



F(v) 



/G2 



VaGE(i)) a e f, 

Va,bGL(,;) </>(/, M(v) , O, 6) 



(5) 
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where 



4>if,fJ',a,b) :-- 



^(aefAbef), 

a£ f ^ b£ f, 

a£ f =^ b£ f, 

a£ f ^ b£ f, 
true, 



fJ-ab = 
fJ-ab = <— , 
^J'ab=-L . 



It is easy to see that F{v) D F(t'). We have F{v) ^ F(t;) for some verbs, as shown in 
Subsection 14.31 In our apphcation, however, the number of frames introduced by using 
F{v) rather than F(v) is small, see the last paragraph of Subsection 14.31 F{v) may be 
used conveniently also for syntactic parsing of sentences. Typically, a grammar parser 
checks whether a hypothetical frame / of the parsed sentence belongs to the set F(f), 
defined by a valence dictionary linked to the parser. If F{v) rather than F{v) is used for 
parsing, which enlarges the set of accepted sentences, then there is no need to compute 
F{v) in order to check whether / G F{v). The parser can use a valence dictionary 
which is stored just as the triple {'L{v),F{v),M.{v)). In our application, however, the 
reconstructed set F{v) is needed explicitly for dictionary evaluation. Thus we provide 
an efficient procedure to compute F{v) in Appendix lAl 



3. The adjusted extraction procedure 

3.1. Overview 

Our valence extraction procedure consists of four distinct subtasks. 

Deep non-probabilistic parsing of corpus data: The first task was parsing 
a part of the IPI PAN Corpus of Polish to obtain a bank of reduced parse f orests, whic h 
represent alternative valence frames for elementary clauses suggested by ISwidzins"kit s 
grammar. The details of this procedure are described in Appendix IbI 

The obtained bank included 510 743 clauses which were decorated with reduced parse 
forests like the following two examples (correct reduced parses marked with a '+'): 

'Kto zast^pi piekarza?' 

(= 'Who will replace the baker?') 
+zast^pic :np:acc: :np:nom: 
zast^pic :np:gen: :np:nom: 
'Nie plakal na podium. ' 

(= 'He did not cry on the podium.') 
plakac :np:nom: :prepnp:na:acc: 
+plakac :np:nom: :prepnp:na:loc: 

Reduced parses are intended to be the alternative valence frames for a clause plus 
the lemma of the verb. In contrast to full parses of sentences, reduced parses are 
highly repeatable in the corpus data. Thus, unsupervised learning can be used to find 
approximate counts of correct parses in the reduced parse forests and to select the best 
description for a given sentence on the basis of its frequency in the whole bank. 

EM disambiguation of reduced parse forests: In the second subtask, the re- 
duced parse forests in the bank were indeed disambiguated to single valence frames per 
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clause. It is a standard approach to disambiguate full par se forests w ith a probabilis- 
tic context-free grammar (PCFG). However, reformulating ISwidzihskil s metamorphosis 
grammar as a pure CFG and the subsequent unsupervised (for the lack of a treebank) 
PCFG training would take too much work for our purposes. Thus we have disambiguated 
reduced parse forests by means of the EM selection algorithm introduced in Appendix ICl 
Let Ai be the set of reduced parse trees for the i-th sentence in the bank, i = 1,2, M. 
We set the initial p^-^^ = 1 and applied the iteration (fTT]) -(fT2 ] l from Appendix O until 

(n) 

n = 10. Then one of the shortest parses with the largest conditional probability p^^ 
was sampled at random. 

Just to investigate the quality of this disambiguation, we prepared a test set of 190 
sentences with the correct reduced parses indicated manually. Since the output of our 
disambiguation procedure is stochastic and the test set was small, we performed 500 
Monte Carlo simulations on the whole test set. Our procedure chose the correct reduced 
parse for 72.6% sentences on average. Increasing the number of the EM iterations to 
n = 20 did not improve this result. As a comparison, sampling simply a parse j with the 
largest p^"^ yielded an accuracy of 72.4%, sampling a parse with the minimal length was 
accurate in 57.5% cases, whereas blind sampling (assuming equidistribution) achieved 
46.9%. The difference between 72.6% and 72.4% is not significant but, given that it 
does not spoil our results, we prefer using shorter parses. 

Computing the preliminary dictionary from parses: Once the reduced parse 
forests in the bank had been disambiguated, a frequency table of the disambiguated 
reduced parses was computed. This will be referred to as the preliminary valence 
dictionary. The entries in this dictionary looked like this: 

'przylapac' => { 

'np(acc) ,np(gen) ,np(noin) ' => 1, 
+ 'na+np(loc) ,np(nom) ,sie' => 1, 

'na+np(loc) ,np(gen) ,np(nom) ' => 1, 
+ 'np(acc) ,np(noin) ' => 4, 

'adv,np(noin) ' => 1, 
+ 'na+np(loc) ,np(acc) ,np(nom) ' => 3 
} 

The numbers are the obtained reduced parse frequencies, whereas the correct valence 
frames are marked with a '+', cf. Notice that the counts for each parse are low. We 
chose a low frequency verb for this example to make it short. Another natural method 

(n) 

to obtain a preliminary dictionary was to use Mp - coefficients as the frequencies of 
frames. This method yields final results that are 1% worse than for the dictionary based 
on the frequency table. 

Filtering of the preliminary dictionary: The preliminary dictionary contains 
many incorrect frames, which are due to parsing or disambiguation errors. In the last 
subtask, we filtered this dictionary using supervised learning, as done commonly in 
related work. 

For example, the BHT filtering by lBrentl (I1993P is as follows. Let c{v, f) be the count 
of reduced parses in the preliminary dictionary that contain both verb v and valence 
frame /. Denote the frequency of verb v as c{v) = J2fc{v,f). Frame / is retained in 
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the set of valence frames F{v) if and only if 

c{v) 




< «, (6) 



where a = 0.05 is the usual significance level and p/ is a frequency threshold. The 
parameter pf is selected as a value for which the classification rule ([6]) yields the 
minimal error rate against the training dictionary. In the idealized language of statistical 
hypothesis testing, pf equals the empirical relative frequency of frame / for the verbs 
that do not select for / according to the ideal dictionary. 

We have used the BHT as the baseline, against which we have tested a new procedure 
of frame filtering. The new procedure applied the co-occurrence matrices presented in 
Section [21 It was as follows: 

1. Compute L(f ) and E(u) via Equation ([2]) from the sets of valence frames F{v) given 
by the preliminary dictionary. 

2. Correct L(f ) and E(u) using the training dictionary. 

3. Reconstruct F{v) given the new L(t') and E(t'). This reconstruction is defined as 
the substitution F{v) ^ {(/ U E{v)) n L{v) \ f G F{v)}. 

4. Compute M.{v) from F{v) via Equation jS]). 

5. Correct M(w) using the training dictionary. 

6. Reconstruct F{v) given the new M.{v). This reconstruction consists of substitution 
F{v) <— F(f ), where F{v) is defined in Equation jS]) and computed via the procedure 
described in Appendix lAl 

7. Output F(f ) as the valence of verb v. 

Steps 2. and 5. are described in Subsections 13.21 and 13.31 respectively. 

In our experiment, the trainin g dictionary consisted of valen ce frames fo r 832 verbs 
from the dictionary of lSwidzihskil (I1994P . It contained all verbs in lSwidzihskif s dictionary 
except those included in the test set introduced in Section ID 



3.2. Filtering of the argument sets 



For simplicity of computation, the correction of argument sets L(f ) and E(f ) was done 
by setting thresholds for the frequency of arguments as in the maximum likelihood 
thresholding test for frames (MLE) proposed by IKorhonenl (I2002p . Thus a possible 
argument a for verb v was retained if it accounted for a certain proportion of the verb's 
frames in the corpus. Namely, a was kept in L(f ) if and only if 

c{v,a) >pac{v) + 1, (7) 

where c{v) is the frequency of reduced parses in the preliminary dictionary that contain 
V, as in (IH), and c{v, a) is the frequency of parses that contain both v and a. Parameter pa 
was evaluated as dependent on the argument but independent of the verb. The optimal 
Pa was selected as a value for which the classification rule ^ yielded the minimal error 
rate against the training dictionary. 
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The difference between the BHT and the MLE is neghgible if the count of the verb 
c{v) and the frequency threshold pa are big enough. This condition is not always satisfied 
in our application but we preferred MLE for its computational simplicity and its lack 
of need to choose an appropriate significance level a. In a preceding sub experiment, we 
had also tried out the more general model c{v,a) > Pac{v) + ta instead of where 
ta was left to vary. Since ta = 1 was learned for the vast majority of a's then we set 
constant ta = I for all verb arguments later. 

Since the same error rate could be obtained for many different values ofpa, we applied 
a discrete minimization procedure to avoid overtraining and excessive searching. Firstly, 
the resolution level N := 10 was initialized. In the following loop, we checked the error 
rate for each Pa ■= n/N, n = 0,1,.. .,N. The number of distinct PaS yielding the 
minimal error rate was determined and called the degeneration D{N). For D{N) < 10, 
the loop was repeated with N := ION. In the other case, the optimal Pa was returned 
as the median of the D{N) distinct values that allowed the minimal error rate. Selecting 
the median was inspired by the maximum-margin hyperplanes used in support vector 



machines to minimize overtraining (Vapnik, 1995 1. 



Similar supervised learning was used to determine whether a given argument is 
strictly compulsory for a verb. By symmetry, an argument a that was found possible 
with verb v was considered as required unless it was rare enough. Namely, a € L(v) was 
included in the new E(f ) unless 

c{v) - c{v, a) > p^ac{v) + 1, (8) 

where p^a was another parameter, estimated analogously to pa- 



3.3. Correction of the co-occurrence matrices 



Once we had corrected the argument sets in the preliminary dictionary, the respective 
co-occurrence matrices still contained some errors when compared with the training 
dictionary. However, the number of those errors was relatively small and it was not so 
trivial to propose an efficient scheme for their correction. 

A possible approach to such correction is to develop statistical tests with clear null 
hypotheses that would detect structural zeroes in contingency tables 







ae f 




N-Na-Nb + Nab 


Na - Nab 




Nb - Nab 


Nab 



where N = \F{v)\, Na = |(a)|. A*";, = \ {b)\, and Nab = ^ (^)l are appropriate counts of 
frames. Relations <— , — >, and x correspond to particular configurations of structural 
zeroes in these tables. 

Constructing structural zero detection tests appeared to be difficult under the 
common-sense requirement that the application of these tests cannot diminish the 
agreement score (j4]) between the corrected dictionary and the training dictionary. We 
have experimented with several such schemes but they did not pass the aforementioned 
criterion empirically. Eventually, we have discovered successful correction methods 
which rely on the fact that values of matrix cells for fixed arguments tend not to 
depend on a verb, see Section [2l 

In this paper we compare three such correction methods. Let us denote the value 
of a cell M.{v)ab after Step 4 as S. On the other hand, let R be the most frequent 
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relation for arguments a and b given by the training dictionary across different verbs. 
We considered the following correction schemes: 

(A) M(f)afe is left unchanged (the baseline): M(f)afe <— S. 

(B) M.{v)ab becomes verb-independent: M.{v)ab <— R- 

(C) We use the most prevalent value only if there is enough evidence for a verb- 
independent interaction: 

Jr, C{aRb)>ps^RC{a,b)+ts^K, 
[b, else, 

where C(aR6) is the number of verbs for which aRb is satisfied and C{a,b) is 
the number of verbs that take both a and b; both numbers relate to the training 
dictionary. Coefficients ps=^R and is=>R are selected as the values for which rule jH]) 
returns the maximal agreement score ^ against the training dictionary. 

There were only a few relation pairs S =^ R for which method (C) performed sub- 
stitutions M.{v)ab ^ R when applied to our data. These were: <— =>x, ^=>x, _L=^^, 
_L=^— >, and _L=^x. Unlike the case of argument filtering, the optimal ts=^>R was equal 
to 1 only for one relation pair, namely _L^x. The evaluation of methods (A), (B) and 
(C) against an appropriate test set is presented in Section [43l 



4. Evaluation of the dictionary 

4.1. Overview 

Having applied the procedures described in Section [3l we obtained an automatically 
extracted valence dictio nary that i ncluded 5443 verb entries after Step 6, which is 



five times more than in ISwidzihskil (|1994l) . As me ntioned in t he previous section, all 
parameters were trained on frame sets provided by ISwidzihs"ki] (|1994l) for 83 2 verbs. In 
contrast, the valence frames in our test set were simultaneously given by ISwidzihskT 



(fT994ll . IBahkol dlTO . and IPolahskil (fT992]l for 201 verbs different from the training 
verbs. Except for 5 verbs missing in IPolahskil and one missing in IBahkol each verb in 
the test set was described by all dictionaries and we kept track of which dictionary 
contributed which frame. 

We preferred to compare the automatically extracted dictionary with three reference 
dictionaries at once to sort out possible mistakes in them. In particular, the majority 
voting (MV) of the three dictionaries was also considered. The verbs for the test set were 
selected by hand for the following reasons: Firstly, each reference dictionary contained 
a different set of verbs in its full version. Secondly, entries from the dictionaries by 
IBahkol and IPolahskil had to be typed into the computer manually and interpreted by an 
expert since these authors often described arguments abstractly, like the "adverbial of 
time/direction/cause/degree", rather than as NPs, PPs or adverbs. Thirdly, verbs taking 
rare arguments were intentionally overrepresented in our test set. Although we could 
not enlarge or alter the test set easily to perform reasonable n-fold cross-validation, 
the variation of scores can be seen by comparing different automatically extracted 
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Table I. The evaluation of argument filtering. 



POSSIBLE 


Pa 


P 


GSP 


FN 


FP 


E 


np(nom) 


U.Uo 


1 on 


oni 
zUl 


o 
z 


U 


o 
z 


np(acc) 


U.Uo 


1 Oi^ 


14z 


o 

ZO 


y 


O /I 

o4 


sie 


U.Uo 


/ 1 


yo 


on 

zy 


A 

4 


oo 
oo 


np(dat) 


n no 
U.Uz 


00 


Qn 
oU 


Ofi 

ZO 


ii 


OT 

o / 


np(inst) 


U.U4 


on 
oy 


Di 


O 1 


y 


A n 
4U 




n 1 o 
U.lo 


ZO 


K -1 

04 


on 
oU 


rj 

z 


oo 

OZ 


adv 


U.lo 


00 


/If! 


oo 

ZO 


Q Q 
OO 


00 


do+np(gen) 


U.U/ 


ZO 




O K 

ZO 


4 


on 

zy 


na+np(acc) 


U.Uo 




A 1 

4i 


o p: 
ZO 


1 


Of; 

ZO 


r Li 


U.Uo 


Q 

o 


Q1 

oi 


OQ 
ZO 


n 
U 


OQ 
ZO 


w+np(loc) 


U.o4 


1 

1 


Qn 
oU 


on 
oU 


1 


O 1 

ol 


zH-np(inst) 


U.Uo 


Q 

o 


OQ 
ZO 


on 
zU 


U 


on 
zU 


■D"V 

hi 1 


U.14 


/J 

4 


OQ 
ZO 


Of; 
ZO 


n 
Z 


OQ 
ZO 


inf 


0.1 


14 


27 


13 





13 


np(gen) 


0.31 


8 


24 


17 


1 


18 


z+np(gen) 


0.08 


7 


23 


19 


3 


22 


w+np(acc) 


O.OO 


o 



ly 


14 


3 


1 / 


o+np(loc) 


0.03 


11 


19 


8 





8 


za+np(accj 


0.03 


3 


17 


15 


1 


16 


od+np(gen) 


0.1 


o 
z 


1 / 


lo 





lo 


o+np(accJ 


0.01 


13 


16 


6 


3 


9 


adj(nom) 


0.77 


1 


3 


2 





2 


NOT REQUIRED 




P 


GSP 


FN 


FP 


E 


np(nom) 


0.54 


3 


19 


19 


3 


22 


np(acc) 


0.24 


174 


174 


10 


10 


20 


sie 


0.12 


186 


188 


5 


3 


8 


do+np(gen) 


0.04 


201 


199 





2 


2 


inf 


0.13 


199 


199 











np(dat) 


0.02 


201 


199 





2 


2 



dictionaries with different gold-standard dictionaries. We find this more informative 
for future research than the standard cross- vahdation. 

The evaluation is divided into three parts. We analyze some specific errors of our 
two-stage approach, each stage assessed separately. In the following, we relate our results 
to previous research. 

4.2. Analysis of the argument filtering 



Table [J presents the results for parameters pa and p^a tested solely on lSwidzinskil (|1994p 
for the 201 test verbs. The notations in the column titles are: P - the number of positive 
outcomes in the automatically extracted dictionary after Step 3 of dictionary filtering 
(one outcome is one verb taking the argument), GSP - the number of gold-standard 
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positive outcomes in ISwidzinskil (GSP = P — FP + FN), FN - the number of false 
negatives, FP - the number of false positives, and E - the number of errors (E = 
FN+FP). We have < FN, FP < GSP, P, E < 201. The notations for certain arguments 
in the table rows are: sie - the reflexive marker sig, x+np(y) - the prepositional phrase 
introduced by preposition x requiring a noun in case y, ZE - the clause introduced 
by ie (= that), PZ - the clause introduced by czy (= whether), and BY - the clause 
introduced by zeby (= so as to). 

Although the overall precision of single argument extraction is high (it reaches 89%, 
see the (verb, argument) scores in Table HTl below), all numerical values for this task 
depend heavily on the type of extracted argument. The case of frequency thresholds pa, 
being in the range of 0.02-0.77, is notable. These thresholds are higher for arguments 
that can be used as NP modifiers, e.g. adj(nom) and np(gen), or verbal adjuncts, e.g. 
adv and w+np(loc). In general, the errors concentrate on low-frequency arguments. 
That occurs probably because the frequency of tokens coming from parsing errors does 
not depend systematically on the argument type. Thus this frequency dominates the 
frequency of tokens coming from well parsed sentences for low-frequency types. Except 
for the extraction of a direct object np(acc) and adverbial phrase adv, gold-standard 
positive outcomes (GSP) outnumber the positive ones (P). Put differently, false positives 
(FP) are fewer than false negatives (FN) — although the learning objective was set to 
minimize the error rate (E = FP-I-FN). The same phenomenon appears in lBrentl(ll993p . 

We have also noticed that the extracted valences are better for less frequent verbs. We 
can see several reasons for this. Firstly, there are more types of infrequent verbs than of 
frequent ones, so thresholds pa get more adjusted to the behaviour of less frequent verbs. 
Secondly, the description of infrequent verb valences given by the training dictionary 
is less detailed. In particular, the gold-standard dictionary fails to cover less frequent 
arguments that are harder to extract. Unfortunately, the small size of our training and 
test data does not enable efficient exploration of how thresholds pa could depend on 
the frequency of the verb. According to Table I, about half of the argument types were 
acknowledged in the test data for just a few verbs. 

The arguments that we found particularly hard to extract are the adverbs (adv), 
with inequality P > GSP, and a group of arguments with P much smaller than GSP. 
The latter include several adjunct-like prepositional phrases (e.g., w+np(loc), w means 
in), certain clauses (PZ and BY), and the possible lack of subject np(nom) (= non- 
required np(nom)), which corresponds roughly to the English expletive it. The inequality 
P > GSP for adverbs probably reflects their inconsistent recognition as verb arguments 
in the gold standard. 

The climbing of clitics and objects was another important problem that we came 
across when we studied concrete false positives. Namely, some arguments of the Polish 
infinitive phrase required by a finite verb can be placed anywhere in the sentence. In 
contras t to Roman ce languages, this phenomenon concerns not only clitics. Unfortu- 
nately, ISwidzihs"kil s grammar does not model either object or clitic climbing and this 
could have caused the following FPs: 



— 4 of 9 outcomes for np(acc): kazac (= order), moc (= may), musiec (= must), 
starac (si^) (= make efforts). 



— 3 of 11 outcomes for np(dat): moc, pragnqc (= desire/wish), starac (si^). 
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There were no FPs that could be attributed to the climbing of the reflexive marker si^, 
although this clitic climbs most often. For no clear reason, the optimal threshold pa for 
si^ was much higher for the training dictionary than for the test dictionary. 

These three frequent arguments also featured relatively many FPs that were due to 
omissions in the test dictionary: 

— 1 of 9 outcomes for np(acc): skarzyc (= accuse), 

— all outcomes for si^: pogorszyc (= make worse), przyzwyczajac (= get used), 
wylewac (= pour out), zwiqzac (= bind), 

— 6 of 11 outcomes for np(dat): dec (= flow), dostosowac (= adjust), drzec (= thrill), 
dzwigac (= carry), ratowac (= save), wsadzic (= put into). 

As we can see, almost all FPs for these arguments are connected either to clitic and 
object climbing or to omissions in the test set. There is room for substantial improvement 
both in the initial corpus parsing and in the test dictionaries. 

4.3. Evaluation of the co-occurrence matrix adjustment 

We obtained the following agreement scores for the three methods of co-occurrence 
matrix adjustment defined in Section [331 

agreement score 



method (A) — no adjustment (baseline) 
method (B) — verb-independent matrices 
method (C) — a combination of those 



77% 



The scores are statistics Jl} computed on the 201 test verbs for the dictionary of 



Swidzihskil (|1994p and the preliminary dictionary processed until Step 6. Method (C) 
gave the best results so it is the only method considered subsequently. 

In more detail. Table HTl presents scores for all manually compiled dictionaries and the 
automatically extracted dictionary at several stages of filtering: AE is the preliminary 
dictionary, AE-A is the dictionary after correcting the argument sets (Step 3), AE-C is 
the one where co-occurrence matrices were corrected using method (C) (Step 6), and 
AE-F is the baseline filtered only with the frame-based binomial hypothesis test ([6]). We 
have constructed several dictionaries derived from these, such as set-theoretic unions, 
intersections, or majority voting, but present only the best result — the AE-C-I-F, which 
is the union of frames from the two-stage filtered AE-C and the o ne- stage filte red AE-F. 



The displayed MV is the majority voting of IBanko] IPolanskil and lSwidzinskil which are 
denoted as Bah., Pol., and Swi. 

Each cell of two triangular sections of Table [III presents the number of pairs, 
(verb, frame) or (verb, argument), that appear simultaneously in two dictionaries 
specified by the row and column titles counted for the 201 test verbs. The displayed 
recall, precision, and F-score were computed against the MV dictionary. Recall and 
precision against other dictionaries can be computed from the numbers given in the 
triangular sections. 

Although a large variation of precision and recall can be observed in Table [HI the 
F-scores do not vary so much. Assuming the F-score as an objective to be maximized, 
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the two-stage filtering is better than the frame-based BHT. Namely, we have F = 42% 
for the AE-C whereas F = 39% for the AE-F, the scores referring to pairs (verb, frame). 
The set-theoretic union of both dictionaries, AE-C-I-F, exhibits even a larger F = 45%. 
In the case of not displayed dictionaries, we have observed the following triples of 
recall/precision/F-score: (a) 20%/81%/32% for the intersection of AE-A, AE-C, and 
AE-F, (b) 33%/61%/43% for their majority voting, (c) 39%/45%/42% for their union, 
and (d) 39%/46%/42% for the union of just AE-A and AE-F. 

The precision of both AE-C and AE-F with respect to the MV is equal to or higher 
than that of manually edited dictionaries, whether we look at single arguments or at 
frames. A word of caution is in order, however. Very high precision against the MV test 
dictionary, provided the recall is sufficient, is a desirable feature of the automatically 
extracted dictionary. The converse should be expected for the contributing sources of the 
MV dictionary. These should be favoured for presenting frames not occurring in other 
sources provided all frames are true. Formally, the contributing sources should feature 
very high recall and relatively lower precision against their MV aggregate. Exactly this 
can be observed in Table HIl 

In general, through the correction of co-occurrence matrices in Step 5 and the frame 
reconstruction more frames are deleted from the AE-A dictionary than added. The 
AE-A contains 338 pairs (verb, frame) which do not appear in the obtained AE-C 
dictionary, whereas only 13 such pairs from the AE-C are missing in the AE-A. The 
sets of pairs (verb, argument) are almost the same for both dictionaries. 

A problem that is buried in the apparently good-looking statistics is the actual shape 
of co-occurrence matrices in the AE-C dictionary. In Step 5 of dictionary filtering, many 
matrix cells are reset as independent of the verb. This affects verbs such as dziwic 
(= surprise/wonder). The correct set of frames for this verb is close to 

{np(nom), np(acc)} , 
{ZE, np(acc)} , 

F (dziwic) = { {np(nom), sie} , \. (10) 

{np(nom), sie, np(clat)} 
{np(nom), sie, ZE} 

The subordinate clause ZE excludes subject np(nom) when si^ is missing but it ex- 
cludes direct object np(acc) when si^ is present (for there is a refiexive diathesis, 
dziwic si^=be surprised). 

The reconstruction ([5]) does not recover the frame set (fTO]l properly for two reasons. 
Firstly, clause ZE excludes np(acc) and implies np(nom) for the majority of verbs. 
Secondly, the co-occurrence matrix formalism cannot model any pairwise exclusion that 
is conditioned on the absence or presence of another argument. However, we suppose 
that such an argument interaction is very rare and this deficiency is not so important 
en masse. 



4.4. Comparison with previous research 



The scores reported in the literature of verb valence extraction are so varied that 
fast conclusions should not be drawn from just a single figure. For example, IBrentI 
(|1993ll achieved 60% recall and 96% precision in the unsupervised approach. This was 
done for English and for a very small set of extracted valence frames (the set counted 
only 6 distinct frames). English-based researchers that evaluated their extracted va- 
lence dictionaries against more complex test dictionaries reported the following pairs 
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Table II. The comparison of all dictionaries. 



(verb, frame) 


AE 


AE-A 


AE-C 


AE-C+F 


AE-F 


Bail. 


Pol. 


Swi. MV 


AE 


7877 
















AE-A 


848 


983 














AE-C 


587 


645 


658 












AE-C+F 


675 


674 


658 


746 














004 




/1 1 Q 










Bari. 


857 


494 


418 


469 


311 


1660 






Pol. 


699 


415 


359 


400 


275 


778 


1536 




Swi. 


697 


409 


363 


406 


294 


766 


778 


1374 


MV 


701 


444 


394 


441 


311 


992 


1004 


992 1218 


recall 


0.58 


0.36 


0.32 


0.36 


0.26 


0.81 


0.82 


0.81 


precision 


0.09 


0.45 


0.60 


0.59 


0.75 


0.6 


0.65 


0.72 


F 


0.16 


0.40 


0.42 


0.45 


0.39 


0.69 


0.73 


0.76 



(verb, argument) 


AE 


AE-A 


AE-C 


AE-C+F 


AE-F 


Bari. 


Pol. 


Swi. 


MV 


AE 


4051 


















AE-A 


687 


687 
















AE-C 


674 


674 


674 














AE-C+F 


735 


680 


674 


735 












AE-F 


582 


527 


521 


582 


582 










Bail. 


1093 


611 


603 


639 


524 


1342 








Pol. 


1033 


593 


586 


623 


520 


966 


1336 






Swi. 


988 


589 


581 


618 


521 


907 


963 


1265 




MV 


1007 


608 


600 


638 


530 


1066 


1122 


1063 


1222 


recall 


0.82 


0.50 


0.49 


0.52 


0.43 


0.87 


0.92 


0.87 




precision 


0.25 


0.89 


0.89 


0.87 


0.91 


0.79 


0.84 


0.84 




F 


0.38 


0.64 


0.63 


0.65 


0.58 


0.83 


0.88 


0.85 





of recall/precision: 36%/66% ( [Briscoe and Carroll, 1997D against the COMLEX and 
ANLT dictionaries, 43%/90% ( [Manning, 1993[ ) against The Oxford Advanced Learner's 
Dictionary, and 75%/79% ( [Carroll and Rooth, 1998[ ) against the same dictionary. 

Other factors matter as well. IKorhonenI (|2002l page 77) demonstrated that the results 
depend strongly on the filtering method: BHT gives 56%/50%, LLR — 48%/42%, MLE 
— 58%/75%, no filtering — 84%/24%, all methods being frame-based and applied to 
the same English data. For Czech, a close relative of Polish, ISarkar and Zemalil (|2000p 
found the recall/precision pair 74%/88% but these were evaluated against a manu- 
ally annotated sample of texts rather than against a gold-standard valence dictionary. 
Moreover, ISarkar and Zemanl acquired valence frames from a manually disambiguated 
treebank rather than from raw data, so automatic parsing did not contribute to the 
overall error rate. 

The closest work to ours is Fast and Przepi6rkowski[ (I2005P , who regarded their own 
work as preliminary. They also processed only a small part of the 250-million-word 
IPI PAN Corpus. Approximately 12 million running words were parsed but sentence 
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parsing was done with a simple 18-rule regular grammar rather than with ISwidzihski 



grammar. Moreover, the dictionary filtering was done according to several frame-based 
metho ds discussed in the literature and the reference dictionary used was only a small 
part of ISwidzihskil (|1994l) — 100 verbs for a training set and another 100 verbs for a test 
set. In contrast to our experiment, [Fast and Przepiorkowski extracted only non-subject 
NPs and PPs. They ignored subjects, np(nom), since almost all verbs subcategorize for 
them. The best score in the complete frame extraction they reported was 48% recall 
and 49% precision {F = 48%), which was obtained for the supervised version of the 
binomial hypothesis test 



So as to come closer to the experimental setup of Fast and Przepiorkowski we reap- 



plied all frame filtering schemes to the case when only non-subject NPs and PPs were 
retained in the preliminary dictionary AE and the three manually edited dictionaries. 
The statistics are provided in Table IIIII Under these conditions our two-stage filtering 
method added to the frame-based BHT is better again than any of these methods 
separately; F = 57% for the AE-C+F vs. F = 53% for both the AE-F and AE-C. The 
AE-C+F is not only better than the AE-F and AE-C with respect to F-score but it also 
contains 15% to 38% more frames. Much higher precision of all these dictionaries than 



reported by [Fast and Przepiorkowski (I2005P may be attributed to the deep sentence 
parsing with Swigra and the EM disambiguation. The best recall remains almost the 
same (47%) for the AE-C+F dictionary, although we extracted valences from a four 
fold smaller amount of text. 



5. Conclusion 

Two new ideas for valence extraction have been proposed and applied to Polish language 
data in this paper. Firstly, we have introduced a two-step scheme for filtering incorrect 
frames. The list of valid arguments was determined for each verb first and then a method 
of combining arguments into frames was found. The two-stage induction was motivated 
by an observation that the argument combination rules, such as co-occurrence matrices, 
are largely independent of the verb. We suppose that this observation is not language- 
specific and the co-occurrence matrix formalism can be easily tailored to improve verb 
valence extraction for many other languages and special datasets (also subdomain cor- 
pora and subdomain valence dictionaries). The second new idea is a simple EM selection 
algorithm, which is a natural baseline method for unsupervised disambiguation tasks 
such as choosing the correct valence frame for a sentence. In our application it helped 
high-precision valence extraction without a large treebank or a probabilistic parser. 

Although the proposed frame filtering technique needs further work to address the 
drawbacks noticed in Subsection l4.3l and to improve the overall performance, the present 
results are encouraging and suggest that two-step frame filtering is worth developing. 
In future work, experiments can be conducted using various schemes of decomposing 
the information contained in the sets of valence frames and, due to the scale of the 
task, this decomposition should be done to a large extent in an algorithmic way. The 
straightforward idea to explore is to express the verb valence information in terms of 
n-ary rather than binary relations among verbs and verb arguments, where n > 2. 
Subsequently, one can investigate the analogous learning problem and propose a frame- 
set reconstruction scheme for the n-ary relations. Are ternary relations sufficient to 
describe the valence frame sets? We disbelieve that relations of irreducibly large arities 
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Table III. The 


case of 


source dictionaries restricted to non-subject NPs and PPs. 




(verb, frame) 


AE 


AE-A 


AE-C 


AE-C+F 


AE-F 


Bail. 


Pol. 


Swi. 


MV 


AE 


3746 


















AE-A 


695 


713 
















AE-C 


533 


539 


544 














AE-C+F 


615 


585 


544 


626 












AE-F 


453 


41 / 


Oil 


4oo 


400 










Ban. 


827 


481 


407 


463 


377 


1255 








Pol. 


693 


426 


367 


412 


338 


684 


1128 






Swi. 


645 


422 


368 


413 


346 


662 


661 


939 




MV 


694 


455 


395 


446 


372 


820 


819 


797 


955 


recall 


0.73 


0.48 


0.41 


0.47 


0.39 


0.86 


0.86 


0.83 




precision 


0.19 


0.64 


0.73 


0.71 


0.82 


0.65 


0.73 


0.85 




F 


0.30 


0.55 


0.53 


0.57 


0.53 


0.74 


0.79 


0.84 





(verb, argument) 


AE 


AE-A 


AE-C 


AE-C+F 


AE-F 


Ban. 


Pol. 


Swi. 


MV 


AE 


2364 


















AE-A 


392 


392 
















AE-C 


385 


385 


385 














AE-C+F 


415 


388 


385 


415 












AE-F 


354 


327 


324 


354 


354 










Bail. 


717 


353 


349 


369 


322 


881 








Pol. 


659 


333 


330 


346 


306 


603 


813 






Swi. 


585 


323 


319 


334 


296 


547 


567 


715 




MV 


633 


346 


342 


360 


317 


665 


685 


629 


747 


recall 


0.85 


0.46 


0.46 


0.48 


0.42 


0.89 


0.92 


0.84 




precision 


0.27 


0.88 


0.89 


0.87 


0.90 


0.75 


0.84 


0.88 




F 


0.41 


0.60 


0.61 


0.62 


0.57 


0.81 


0.88 


0.86 





appear in human language lexicons since, for example, IHalford et al.l (|1998p observed 
that human capacity for processing random n-ary relations depends strongly on the 
relation arity. 

Knowing algebraic constraints on the verb argument combinations is important also 
for language resource maintenance. Because our test dictionaries do not list valid ar- 
gument combinations extensively, many false positive frames in the two-stage corrected 
dictionary were in fact truly positive. Thus, it is advisable to correct gold-standard 
dictionaries themselves, for example using a modification of the reconstruction ((5|). 
However, prior to resetting the gold-standard in this way, it must be certain that the 
reconstruction process does not introduce linguistically implausible frames. Also for this 
reason, the effective complexity of verb-argument and argument-argument relations in 
natural language should be investigated thoroughly from a more mathematical point of 
view. 
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Appendix 

A. A faster reconstruction of the frame set 

Although there is no need to compute F{v) defined in ([5|) to verify condition / G F(u) 
for a given /, the reconstruction F{v) can be computed efficiently if needed for other 
purposes. A naive solution suggested by formula jS]) is to search through all elements 
of the power set 2^^'"^ and to check for each independently whether it is an element of 
F(w). However, we can do it faster by applying some dynamic programming. 

Firstly, let us enumerate the elements of L(f) = {61, b2, ■■■,biy}. In the following, we 
will compute the chain of sets Aq,Ai, ...,A]\f where An = {(-B„, n f,Bn \ f)\f € F(f)} 
and Bn = {61,62, ...,6„}. 

In fact, there is an iteration for this chain: 



^o = {(0,0)}, 

(/U{6n},5) 



A^ 



u < 



(/,5U{64) 



{f,g) G An-i, 

VaegM(u)fc„a 

{f,g) G An-i, 
{bn} E(t;), 
V„g/M(^;)b„, 



Once the set Ajy = {(/, L(f) \ f)\f G F(u)} is computed, F{v) can be read off easily. 



B. Parsing of the IPI PAN Corpus 

The input of the valence extraction experiment discussed in this paper came from 
the 250-million-word IPI PAN Corpus of Polish (http : //korpus . pi/ ) . The original 
automatic part-of-speech annotation of the text was removed, since it contained too 
many errors, and the sentences from the corpus were analyzed using the Swigra parser 
(|Wolihskil 120041 I2005p . see also http :7/nip . ipipan.waw.pl/~wolinski/swigra/ 



Technically, Swigra utilizes two distinct language resources: (1) Morfeusz — a dictionary 
of inflected words (a.k.a. a morphological analyzer) programmed by IWolihskil (|2006|) 
on the basis of about 20,000 stemming rules compi led by ITokarskil (I1993|l, and (2 ) 
GFJP — the formal grammar of Polish written by ISwidzihs"kil (|1992p . ISwidzihskit s 
grammar is a DCG-like grammar, close to the format of the metamorphosis grammar 
by IColmerauerl (|1978p . It counts 461 rules and examples of its parse trees can be found 
in IWolihskil (|2004l) . For the sake of this project, Swigra used a fake valence dictionary 
that allowed any verb to take none or one NP in the nominative (the subject) and any 
combination of other arguments. 

Only a small subset of sentences was actually selected to be parsed with Swigra. 
The following selection criteria were applied to the whole 250-million-word IPI PAN 
Corpus: 

1. The selected sentence had to contain a word recognized by Morfeusz as a verb and 
the verb had to occur > 396 times in the corpus. (396 is the lowest corpus frequency 
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of a verb from the test set described in Section [H The threshold was introduced to 
speed up parsing without loss of empirical coverage for any verb in the test set. The 
selected sentence might contain another less frequent verb if it was a compound 
sentence.) 

2. The selected sentence could not be longer than 15 words. (We supposed that the 
EM selection would find it difficult to select the correct parse for longer sentences.) 

3. Maximally 5000 sentences were selected per recognized verb. (We supposed that 
a frame which was used less than once per one 5000 verb occurrences would not be 
considered in the gold-standard dictionaries.) 

In this way, a subset of 1011991 sentences (8 727441 running words) was chosen. They 
were all fed to Swigra's input but less than half (0.48 million sentences) were parsed 
successfully within a preset time of 1 minute per sentence. Detailed statistics are given 
in Table [IV] below. All mentioned thresholds were introduced in advance to compute 
only the most useful parse forests in the pre-estimated total time of a few months. It 
was the first experiment ever in which Swigra was applied to more than several hundred 
sentences. The parsing actually took 2 months on a single PC station. 

Not all information contained in the obtained parse forests was relevant for valence 
acquisition. Full parses were subsequently reduced to valence frames plus verbs, as in the 
first displayed example in Section[3l First of all, the parse forests for compound sentences 
were split into separate parse forests for elementary clauses. Then each parse tree was 
reduced to a string that identifies only the top-most phrases. To decrease the amount 
of noise in the subsequent EM selection and to speed up computation, we decided to 
skip 10% of clauses that had the largest number of reduced parses. As a result, we only 
retained clauses which had < 40 reduced parses. 

To improve the EM selection, we also deleted parses that contained certain syntacti- 
cally idiosyncratic words — mostly indefinite pronouns to (= this), co (= what), and nic 
(= nothing) — or highly improbable morphological word interpretations (like the second 
interpretation for albo = 1. the conjunction or; 2. the vocative singular of the noun alb — 
a kind of liturgical vestment). The stop list of improbable interpretations consisted of 
646 word interpretations which never occurred in the SFPW Corpus but were possible 
interpretations of the most common words according to Morfeusz. The SFPW Corpus is 
a manually POS tagged 0.5-million-word corpus prepared for the frequency dictionary 



of 1960s Polish (Kurcz et al., 19901, which was actually commenced in the 1960s but 



not published until 1990. 



Our format of reduced parses approximates the format of valence frames in lSwidzihski 



(|1994|) . so it diverges from the format proposed by |Przepi6rkowski (|2006p . To convert 



a parse in Przepiorkowski's format into ours, the transformations must be performed as 
follows: 

1. Add the dropped personal subject or the impersonal subject expressed by the 
ambiguous reflexive marker si^ when their presence is implied by the verb form. 

2. Remove one nominal phrase in the genitive for negated verbs. (An attempt to treat 
the genitive of negation.) 



3. Transform several frequent adjuncts expressed by nominal phrases. 
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Table IV. Sizes of the processed parts of the IPI PAN Corpus. 

I sentences/clauses words 

sentences sent to Swigra's input 1 Oil 991 sentences 8 727 441 

sentences successfully parsed with Swigra 481 039 sentences 3 421 863 

sentences with < 40 parses split into clauses 569 307 clauses 3 149 391 

the final bank of reduced parse forests 510 743 clauses 2 795 357 



4. Skip the parse if it contains pronouns to (= this), co (= what), and nic (= nothing). 
(Instead of converting these pronouns into regular nominal phrases.) 

5. Remove lemmas from non-verbal phrases and sort phrases in alphabetic order. 

The resulting bank of reduced parse forests included 510 743 clauses with one or 
more proposed valence frames. We parsed successfully only 3.4 million running words of 
the whole 250-million-word IPI PAN Corpus — four times less than the 12 million words 



parsed by Fast and Przepiorkowski (I2005p . However, our superior results in the valence 
extraction task indicate that skipping a fraction of available empirical data is a good 
idea if the remaining data can be processed more thoroughly and the skipped portion 
does not provide different efficiently usable information. 



C. The EM selection algorithm 

Consider the following abstract statistical task. Let Zi, Z2, Zm , with Zj : $7 — > J, 
be a sequence of discrete random variables and let Yi,Y2, ■.■,Ym be a random sample 
of sets, where each set 1^ : Q — > 2"^ \ contains the actual value of Zi, i.e., Zi ^ ¥{. 
The objective is to guess the conditional distribution of Zi given an event {¥{ = Ai)^^, 
Ai C J. In particular, we would like to know the conditionally most likely values of 
Zi. The exact distribution of Yi is not known and unfeasible to estimate if we treat the 
values of Yi as atomic entities. We have to solve the task via some rationally motivated 
assumptions. 

Our heuristic solution was iteration 

in) ^{pf/Y.reAjp\ jeAi, 

lo, else, 



-, M 

(1) (n) 

with = 1. We observed that coefficients p^^ converge to a value that can be plausibly 
identified with the conditional probability P{Zi = j\Yi = Ai). 

Possible applications of iteration (fTT]) - (fT2]) . which we call the EM selection algorithm, 
cover unsupervised disambiguation tasks where the number of different values of Yi is 
very large but the internal ambiguity rate (i.e., the typical cardinality \Yi\) is rather small 
and the alternative choices within Yi (i.e., the values of Zi) are highly repeatable. There 
may be many applications of this kind in NLP and bioinformatics. To our knowledge, 
however, we present the first rigorous treatment of this particular selection problem. 
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In this appendix, we will show that the EM selection algorithm belongs to the 
class of expectation-maximization (EM) algorithms. For this reason, our algorithm 
resembles many instances of EM used in NLP, such as the Baum- Welch algorithm for 
hidden Markov models ( Baum, 1972 ) or linear interpolation ( Jelinek, 1997p . However, 



normalization (fTT]l . which is done over varying sets Ai — unlike the typical case of linear 
interpolation, is the singular feature of EM selection. The local maxima of the respective 
likelihood function also form a convex set, so there is no need to care much for initializing 
the iteration (fTT]) - (fT2]) . unlike e.g. the Baum- Welch algorithm. 



To begin with, we recall the universal scheme of EM (Dempster et al., 1977 



Neal and Hinton, 1999 1. Let P{Y\9) be a likelihood function, where Y is an observed 
variable and 9 is an unknown parameter. For the observed value Y, the maximum 
likelihood estimator of 6 is 

0MLE = argmaxP(y|6'). 

e 

When the direct maximization is impossible, we may consider a latent discrete variable 
Z and function 

Qie', e") = Y,P{z = z% 9') log p{z = z, Y\e"), 

z 

which is a kind of cross entropy function. The EM algorithm consists of setting an initial 
parameter value 6i and iterating 

9n+i = argmaxQ(6'„,6l) (13) 
e 

until a sufHcient convergence of 9n is achieved. It is a general fact that P {Y\9ri+i) > 
P {Y\6n) but EM is worth considering only if maximization (fT3]) is easy. 

Having outlined the general EM algorithm, we come back to the selection problem. 
The observed variable is y = {Yi,Y2, ...,Ym), the latent one is Z = {Zi, Z2, Zm), 
whereas the parameter seems to be 9n = (pj"'') .^j- The appropriate likelihood func- 
tion remains to be determined. We may suppose from the problem statement that it 
factorizes into P{Z,Y\0) = YliP{Zi,Yi\9). Hence Q{9',9") takes the form 



Q{9',e") = ^5]P(Z, = j|y, = Ai,9')logP{Z,=j,Y, = A,\9"). 

i j 

Assume now 

P(y. = A|Z,=j,0) = |^^^)' 2^^' (14) 

P{Z,=j\9)=p, (15) 
for 9 = {pj)j^j and a parameter-free function g{-) satisfying 

^ 1{,-6A}<?(^) = 1, Vj G J, (16) 

where 

11, (/> is true, 



0, else. 
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For example, let g{A) = q^^\~^{l — g)!"^!"!"^!, where \A\ stands for the cardinality of set 
A and < g < 1 is a fixed number not incorporated into 9. Then the cardinalities of 
sets Yi are binomially distributed, i.e., -P(|^i| — 1|^) ~ B{\J\ — l,q). This particular form 
of g{A), however, is not necessary to satisfy (fT6]) . 

The model (fHl) - (fT5]l is quite speculative. In the main part of this article, we need to 
model the probability distribution of the reduced parse forest Yi under the assumption 
that the correct parse Zi is an arbitrary element ofYi. In particular, we have to imagine 
what P{Yi = A\Zi = j, 9) is like if j is a semantically implausible parse. We circumvent 
the difficulty by saying in ([14| that this quantity is the same as if j were the correct 
parse. 

Assumption (fT4ll leads to an EM algorithm which does not depend on the specific 
choice of function g{-). Therefore the algorithm is rather generic. In fact, (fT4l) assures 
that P{Y, = A,\e) = g{A,)P{Zi G Ai\9) and 



P{Zi =j\Yi = Ai,e) = P{Zi =j\Z,eAi,9). 



In consequence, iteration (1131) is equivalent to 



d 

dpj 



Q{9n,9)-X \ Y.Pj'-'^ 



(n) 

2^1=1 Pji 

(n+l) 
P) 



(17) 



(18) 



where p^^^ = P{Zi = j\Zi G Ai,9n) is given exactly by (fTTI) . 

If the Lagrange multiplier A is assigned the value that satisfies constraint J2j£jPj' = 
1 then equation (fTSl) simplifies to (fT2]l . Hence it becomes straightforward that iteration 
(fTT]) -(fT2 ] l maximizes locally the log-likelihood 



m :=logPi{Yi = A,)fi,\9) = log 
or simply > L^") for 



■ M 

n 

i=l 



PjZj e Ai\9) 
9{A^) 



(19) 



M 



M 



L(") := L{9n)+Y.^ogg{A,) = J^log 



i=l 



1=1 



(n) 



n > 2. 



Moreover, there is no need to care for the initialization of iteration (fTT|l - (fT2|) since 
the local maxima of function ([19]) form a convex set M., i.e., 9,9' £ A4 =^ q9 + 
(1 — q)9' G for < g < 1. Hence that function is, of course, constant on A4. 
To show this, observe that the domain of log-likelihood (fT9]) is a convex compact set 

V = ^9 : J2jPj = li Pj ^ o}- The second derivative of L reads 



Ljj>{9) :-- 



d'^L{9) 
dpjdpj' 



M 

E 



gA, Pj" 



Since matrix {Ljj'] is negative definite, i.e., X^jj' — 0) function L is concave. 

As a general fact, a continuous function L achieves its supremum on a compact set V 
(|R,udin[ [T9741 Theorem 2.10). If additionally L is concave and its domain V is convex 
then the local maxima of L form a convex set ^A C where L is constant and achieves 



its supremum ( Boyd and Vandenberghe , 120041 Section 4.2.2). 
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